Skip to content

Conversation

@gabrielapgomezji
Copy link
Contributor

@gabrielapgomezji gabrielapgomezji commented Nov 24, 2025

This issue addresses #1728

The ToFloat transformer now includes a decimal parameter that lets the user specify the decimal separator to use for the given column. Then, all the possible thousands separators are removed, and the decimal separator is converted to a . before the column is passed to to_float32.

@rcap107 rcap107 changed the title WIP: Adding decimal conversion and tests FEAT - Adding decimal as parameter for ToFloat32 Nov 24, 2025
@gabrielapgomezji gabrielapgomezji marked this pull request as ready for review December 1, 2025 14:03
(",56", 0.56, ","),
],
)
def test_number_parsing(input_str, expected_float, decimal, df_module):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might it be worth adding tests for the code's behaviour in case of an invalid entry?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we should check a few weird cases and make sure they fail as expected

@rcap107
Copy link
Member

rcap107 commented Dec 2, 2025

After some discussion, I think this PR needs some more time before it can be merged, and unfortunately won't be part of the next release.

The current implementation is removing all thousands separators other than what is specified as the "decimal" separator, which is quite risky and may leads to problems. It's better to follow what pandas is doing, i.e., have both decimal and thousands as separators. By default, the thousands separator should be None (so no replacement).

If there is some kind of weird string like 1,2.3,4, it should not be parsed as a number. I am not sure how far we should do to parse something like 1,2.34 with decimal . and thousands ,: it's not a format I recognize, but it would still be recognized as 12.34 rather than being rejected.

Another check that may be considered is counting the number of decimal separators, and reject any case where there is more than one.

Some additional comments:

  • While it's impossible to test all possible scenarios, tests should also include as many weird edge cases as we can come up with to see what could be the result.
  • The ToFloat docstring needs some more work to explain in more detail the behavior when decimal and thousands are set.

I'll convert this back to draft and keep an eye on this for the next PR.

@rcap107 rcap107 marked this pull request as draft December 2, 2025 13:54
@gabrielapgomezji
Copy link
Contributor Author

When talking about the tests, it was mentioned to include 3 tests:

  • A test for Good inputs
  • A test for Bad Inputs
  • A test for bad parameters
    I merged the last two tests including also bada parameters in the test. If it's better to have the 3 tests individually instead of the 2, I will modify it.

@rcap107 rcap107 marked this pull request as ready for review December 17, 2025 13:00
Comment on lines +65 to +68
During ``fit``, |ToFloat| attempts to convert all values in the column to
numeric values after automatically removing other possible thousands separators
(``,``, ``.``, space, apostrophe). If any value cannot be converted, the column
is rejected with a ``RejectColumn`` exception.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not how the current version of the code is working: the regex pattern should reject anything that contains characters different from either the decimal or thousands separator.

There should also be an explanation of how the check is done (checking if there are parentheses, checking if thousands are separated by groups of 3 digits, adding the scientific notation)

("1,,234", ".", ","),
("1.23,45", ".", ","),
# decimal == thousand
("123,456,789", ",", ","),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are testing that RejectColumn is raised as expected when it encounters values that should not be converted. This case should be moved to a separate test that verifies that the correct exception is raised if the parameters are incorrect. The same (new) test should also check that a ValueError is raised if decimal is None.

@rcap107
Copy link
Member

rcap107 commented Dec 17, 2025

Thanks a lot for the PR @gabrielapgomezji! This will be very useful for parsing data that is not in the usual locale.

My comments are mostly about improving clarity in the documentation and adding comments in the code. I think the actual content of the PR is in a good shape, it's just a matter of polishing at this point.

Comment on lines 279 to 280
if self.thousand is None:
self.thousand = "" # No thousand separator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be moved to the init, parameters should not be modified in the fit

ggomezji and others added 10 commits January 19, 2026 15:08
…ical.rst

Co-authored-by: Riccardo Cappuzzo <7548232+rcap107@users.noreply.github.com>
…ical.rst

Co-authored-by: Riccardo Cappuzzo <7548232+rcap107@users.noreply.github.com>
…ical.rst

Co-authored-by: Riccardo Cappuzzo <7548232+rcap107@users.noreply.github.com>
Co-authored-by: Riccardo Cappuzzo <7548232+rcap107@users.noreply.github.com>
Co-authored-by: Riccardo Cappuzzo <7548232+rcap107@users.noreply.github.com>
Co-authored-by: Riccardo Cappuzzo <7548232+rcap107@users.noreply.github.com>
Co-authored-by: Riccardo Cappuzzo <7548232+rcap107@users.noreply.github.com>
@rcap107 rcap107 force-pushed the 1728-ToFloat_improvement branch from 5e1dd9b to 095f403 Compare January 20, 2026 10:55
def _str_is_valid_number_polars(col, number_re):
# Check if all values in the column match the number regex.
# - Fill NaN values with empty string to avoid match errors.
# - Use `str.match` with `na=False` to treat empty/missing values as non-matching.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# - Use `str.match` with `na=False` to treat empty/missing values as non-matching.
# - Use `str.contains` with `literal=False` to treat empty/missing values as non-matching.

Copy link
Member

@rcap107 rcap107 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more cosmetic fixes, but I think we're almost done here. Thanks @gabrielapgomezji

Comment on lines +119 to +120
strings to floats. Other possible decimal separators are removed from
the strings before conversion.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not the case anymore

Suggested change
strings to floats. Other possible decimal separators are removed from
the strings before conversion.
strings to floats.

1 12300.0
Name: x, dtype: float32
It is possible to specify the thousands separator, e.g., to use " "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
It is possible to specify the thousands separator, e.g., to use " "
It is possible to specify the thousands separator, e.g., to use ``" "``

It is possible to specify the thousands separator, e.g., to use " "
>>> s = pd.Series(["4 567,89", "12 567,89"], name="x")
>>> ToFloat(decimal=",", thousand=" ").fit_transform(s) # doctest: +ELLIPSIS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ELLIPSIS is enabled by default

Suggested change
>>> ToFloat(decimal=",", thousand=" ").fit_transform(s) # doctest: +ELLIPSIS
>>> ToFloat(decimal=",", thousand=" ").fit_transform(s)

@rcap107
Copy link
Member

rcap107 commented Jan 20, 2026

I did a very quick and dirty benchmark comparing the performance of the ToFloat transformer in the latest version of skrub, and the ToFloat in this PR.

I generated a dataframe with 10M rows and 30 columns.

There is a small, but noticeable difference in time when there is no conversion to be done, so we might want to add a condition where the formatting check is skipped in the default case (decimal="." and thousands=None).

skrub version: 0.8.dev0
Elapsed time for ToFloat transformation: 0.3026 seconds
skrub version: 0.7.1
Elapsed time for ToFloat transformation: 0.2914 seconds
Code
import time
import polars as pl
import numpy as np
import skrub
from importlib import metadata
import polars.selectors as cs

version = metadata.version("skrub")
print(f"skrub version: {version}")

# Set random seed for reproducibility
np.random.seed(42)

# Generate random float data
data = np.random.uniform(1000, 1000000, size=(10_000_000, 30))

# Convert floats to strings with space as thousands separator
df = pl.DataFrame(data)

from skrub import ToFloat, ApplyToCols
def benchmark_tofloat(df):
    tic = time.time()
    transformer = ApplyToCols(ToFloat())
    transformed_df = transformer.fit_transform(df)
    toc = time.time()

    return toc - tic

# Run benchmark
times = []
for run in range(100):
    elapsed_time = benchmark_tofloat(df)
    times.append(elapsed_time)

elapsed_time = np.median(times)
print(f"Elapsed time for ToFloat transformation: {elapsed_time:.4f} seconds")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ToFloat fails when trying to parse numbers with "," decimal separators

4 participants